-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[drafr] [fix] [client] Fix IO buffer overflow when resend msg after producer reconnect #21351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[drafr] [fix] [client] Fix IO buffer overflow when resend msg after producer reconnect #21351
Conversation
@@ -2284,12 +2285,18 @@ private void recoverProcessOpSendMsgFrom(ClientCnx cnx, MessageImpl from, long e | |||
if (stripChecksum) { | |||
stripChecksum(op); | |||
} | |||
// To avoid IO buffer overflow, split to multi-flush. | |||
if (messageBytesSizeInCache + op.cmd.readableBytes() < messageBytesSizeInCache) { | |||
cnx.ctx().flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flush
is an async operation, so this might not work as expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since write
and flush
will be executed in the same thread, the actions of both write
and flush
will keep the order to execute. So it will work as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean from a different aspect. I guess that in this case, the purpose of flushing is to ensure that buffers don't overflow. It feels like the solution in this PR won't fully address that since all operations will be queued up in any case unless the logic that calls write is also run in the ctx loop (thread). Other possibility would be to somehow wait for flush completion until it finishes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Netty, there's the Channel.isWritable
method and channelWritabilityChanged
callback that help in keeping the queued writes bounded (see low/high watermark options). However, IIRC, Pulsar code base doesn't show examples of how write logic could be implemented to take advantage of this type of backpressure solution.
Optimally, the logic would be implemented in a way where more writes are added while the channel is writable and then paused. Adding more writes should resume after the channel becomes writable again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that flushing will help resolve the issue.
In Netty, there's the Channel.isWritable
method and channelWritabilityChanged
callback that help in keeping the queued writes bounded (see low/high watermark options). However, IIRC, Pulsar code base doesn't show examples of how write logic could be implemented to take advantage of this type of backpressure solution.
Optimally, the logic would be implemented in a way where more writes are added while the channel is writable and then paused. Adding more writes should resume after the channel becomes writable again.
I'm still not sure how it gets to 2GB of accumulated data. The client memory limit backpressure should have blocked long before the 2GB. |
Motivation
If too many messages were called
sendAsync
when the state of the producer is reconnecting, after connection successfully, the messages cached in the memory will be flushed to the socket once, then the buffer overflow.Modifications
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: x